32 research outputs found

    Relational Playground: Teaching the Duality of Relational Algebra and SQL

    Full text link
    Students in introductory data management courses are often taught how to write queries in SQL. This is a useful and practical skill, but it gives limited insight into how queries are processed by relational database engines. In contrast, relational algebra is a commonly used internal representation of queries by database engines, but can be challenging for students to grasp. We developed a tool we call Relational Playground for database students to explore the connection between relational algebra and SQL

    JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery

    Full text link
    Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse through data in an attempt to observe commonalities in structure across documents to construct suitable code for data processing. However, this process is time-consuming and error-prone. Existing distributed approaches to mining schemas present a significant usability advantage as they provide useful metadata for large data sources. However, depending on the data source, ad hoc queries for estimating other properties to help with crafting an efficient data pipeline can be expensive. We propose JSONoid, a distributed schema discovery process augmented with additional metadata in the form of monoid data structures that are easily maintainable in a distributed setting. JSONoid subsumes several existing approaches to distributed schema discovery with similar performance. Our approach also adds significant useful additional information about data values to discovered schemas with linear scalability

    Comprehending Semantic Types in JSON Data with Graph Neural Networks

    Full text link
    Semantic types are a more powerful and detailed way of describing data than atomic types such as strings or integers. They establish connections between columns and concepts from the real world, providing more nuanced and fine-grained information that can be useful for tasks such as automated data cleaning, schema matching, and data discovery. Existing deep learning models trained on large text corpora have been successful at performing single-column semantic type prediction for relational data. However, in this work, we propose an extension of the semantic type prediction problem to JSON data, labeling the types based on JSON Paths. Similar to columns in relational data, JSON Path is a query language that enables the navigation of complex JSON data structures by specifying the location and content of the elements. We use a graph neural network to comprehend the structural information within collections of JSON documents. Our model outperforms a state-of-the-art existing model in several cases. These results demonstrate the ability of our model to understand complex JSON data and its potential usage for JSON-related data processing tasks

    Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Get PDF
    Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.Comment: SIGMOD'1

    Physical Design for Non-relational Data Systems

    Get PDF
    Decades of research have gone into the optimization of physical designs, query execution, and related tools for relational databases. These techniques and tools make it possible for non-expert users to make effective use of relational database management systems. However, the drive for flexible data models and increased scalability has spawned a new generation of data management systems which largely eschew the relational model. These include systems such as NoSQL databases and distributed analytics frameworks such as Apache Spark which make use of a diverse set of data models. Optimization techniques and tools developed for relational data do not directly apply in this setting. This leaves developers making use of these systems with the need to become intimately familiar with system details to obtain good performance. We present techniques and tools for physical design for non-relational data systems. We explore two settings: NoSQL database systems and distributed analytics frameworks. While NoSQL databases often avoid explicit schema definitions, many choices on how to structure data remain. These choices can have a significant impact on application performance. The data structuring process normally requires expert knowledge of the underlying database. We present the NoSQL Schema Evaluator (NoSE). Given a target workload, NoSE provides an optimized physical design for NoSQL database applications which compares favourably to schemas designed by expert users. To enable existing applications to benefit from conceptual modeling, we also present an algorithm to recover a logical model from a denormalized database instance. Our second setting is distributed analytics frameworks such as Apache Spark. As is the case for NoSQL databases, expert knowledge of Spark is often required to construct efficient data pipelines. In NoSQL systems, a key challenge is how to structure stored data, while in Spark, a key challenge is how to cache intermediate results. We examine a particularly common scenario in Spark which involves performing iterative analysis on an input dataset. We show that jobs written in an intuitive manner using existing Spark APIs can have poor performance. We propose ReSpark, which automates caching decisions for iterative Spark analyses. Like NoSE, ReSpark makes it possible for non-expert users to obtain good performance from a non-relational data system

    NoSQL Schema Design for Time-Dependent Workloads

    Full text link
    In this paper, we propose a schema optimization method for time-dependent workloads for NoSQL databases. In our proposed method, we migrate schema according to changing workloads, and the estimated cost of execution and migration are formulated and minimized as a single integer linear programming problem. Furthermore, we propose a method to reduce the number of optimization candidates by iterating over the time dimension abstraction and optimizing the workload while updating constraints

    The attrition rate of licensed chiropractors in California: an exploratory ecological investigation of time-trend data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The authors hypothesized the attrition rate of licensed chiropractors in California has gradually increased over the past several decades. "Attrition" as determined for this study is defined as a loss of legal authority to practice chiropractic for any reason during the first 10 years after the license was issued. The percentage of license attrition after 10 years was determined for each group of graduates licensed in California each year between 1970 and 1998. The cost of tuition, the increase in the supply of licensed chiropractors and the ratio of licensed chiropractors to California residents were examined as possible influences on the rate of license attrition.</p> <p>Methods</p> <p>The attrition rate was determined by a retrospective analysis of license status data obtained from the California Department of Consumer Affairs. Other variables were determined from US Bureau of Census data, survey data from the American Chiropractic Association and catalogs from a US chiropractic college.</p> <p>Results</p> <p>The 10-year attrition rate rose from 10% for those graduates licensed in 1970 to a peak of 27.8% in 1991. The 10-year attrition rate has since remained between 20-25% for the doctors licensed between 1992-1998.</p> <p>Conclusions</p> <p>Available evidence supports the hypothesis that the attrition rate for licensed chiropractors in the first 10 years of practice has risen in the past several decades.</p

    A united statement of the global chiropractic research community against the pseudoscientific claim that chiropractic care boosts immunity.

    Get PDF
    BACKGROUND: In the midst of the coronavirus pandemic, the International Chiropractors Association (ICA) posted reports claiming that chiropractic care can impact the immune system. These claims clash with recommendations from the World Health Organization and World Federation of Chiropractic. We discuss the scientific validity of the claims made in these ICA reports. MAIN BODY: We reviewed the two reports posted by the ICA on their website on March 20 and March 28, 2020. We explored the method used to develop the claim that chiropractic adjustments impact the immune system and discuss the scientific merit of that claim. We provide a response to the ICA reports and explain why this claim lacks scientific credibility and is dangerous to the public. More than 150 researchers from 11 countries reviewed and endorsed our response. CONCLUSION: In their reports, the ICA provided no valid clinical scientific evidence that chiropractic care can impact the immune system. We call on regulatory authorities and professional leaders to take robust political and regulatory action against those claiming that chiropractic adjustments have a clinical impact on the immune system

    Tropical Data: Approach and Methodology as Applied to Trachoma Prevalence Surveys

    Get PDF
    PURPOSE: Population-based prevalence surveys are essential for decision-making on interventions to achieve trachoma elimination as a public health problem. This paper outlines the methodologies of Tropical Data, which supports work to undertake those surveys. METHODS: Tropical Data is a consortium of partners that supports health ministries worldwide to conduct globally standardised prevalence surveys that conform to World Health Organization recommendations. Founding principles are health ministry ownership, partnership and collaboration, and quality assurance and quality control at every step of the survey process. Support covers survey planning, survey design, training, electronic data collection and fieldwork, and data management, analysis and dissemination. Methods are adapted to meet local context and needs. Customisations, operational research and integration of other diseases into routine trachoma surveys have also been supported. RESULTS: Between 29th February 2016 and 24th April 2023, 3373 trachoma surveys across 50 countries have been supported, resulting in 10,818,502 people being examined for trachoma. CONCLUSION: This health ministry-led, standardised approach, with support from the start to the end of the survey process, has helped all trachoma elimination stakeholders to know where interventions are needed, where interventions can be stopped, and when elimination as a public health problem has been achieved. Flexibility to meet specific country contexts, adaptation to changes in global guidance and adjustments in response to user feedback have facilitated innovation in evidence-based methodologies, and supported health ministries to strive for global disease control targets
    corecore